This tutorial presents the application of RNN for classifying texts using the padding and bucketing tricks to efficiently handle sequences of varying lengths using the MXNet R package.

Example based on sentiment analysis on the IMDB data.

Load some packages

require("readr")
require("dplyr")
require("plotly")
require("stringr")
require("stringi")
require("AUC")
require("scales")
require("mxnet")

Load utility functions

source("mx.io.bucket.iter.seq_len.R")
source("rnn.unroll.R")
source("rnn.R")
source("rnn.train.R")

Motivation for bucketing

Whether we’re working with text at the character or word level, NLP tasks naturally involves dealing with sequences of varying length.

This can present some challenges as the symbolic representation of a RNN model assumes a fixed length sequence. This can be circumvent in two ways:

  • Padding: fill the modeled sequences with an arbitrary word/character up to the longest sequence. This results in sequences of even lengths, but potentially of excessive size for an efficient training.

  • Bucketing: apply the padding trick but to subgroups of samples split according to their lengths. This results in multiples trainings sets, or buckets, within which all samples are padded to an even length.

The bucketing involves training with multiple graphs, one for each bucket. Nonetheless, all these graphs are used to train the same set of parameters since RNNs, being recursive (!), reuse the same parameters for every element of the sequence independently of their numbers.

When regular training, the symbol is binded once to the executors and then at each iteration, the data and the parameters are updated. With bucketing, a new symbol is binded to the executors at each iteration.

For example, if we would be dealing with sequences of length 3 and 6, the two following graphs would be defined prior to training:

rnn_graph_3 <- rnn.unroll.cudnn(seq.len=3, 
                                num.rnn.layer = 2, 
                                num.hidden = 20,
                                input.size=100,
                                num.embed=16, 
                                num.label=2,
                                dropout=0.5, 
                                ignore_label = 0,
                                cell.type="gru",
                                config = "seq-to-one")

rnn_graph_6 <- rnn.unroll.cudnn(seq.len=6, 
                                num.rnn.layer = 2, 
                                num.hidden = 20,
                                input.size=100,
                                num.embed=16, 
                                num.label=2,
                                dropout=0.5, 
                                ignore_label = 0,
                                cell.type="gru",
                                config = "seq-to-one")

graph.viz(rnn_graph_3, type = "graph", direction = "LR", graph.height.px = 200, graph.width.px = 700, shape=c(3, 32))
graph.viz(rnn_graph_6, type = "graph", direction = "LR", graph.height.px = 250, graph.width.px = 700, shape=c(6, 32))

Then, during training, the iterator will feed a new batch of length 3 or 5 plus a bucket ID indicating the appropriate graph to bind to the executor this iteration.

Data preperation

The loaded data has been pre-process into lists whose elements are the buckets containing the samples and their associated labels.

This pre-processing involves 2 scripts:

  • data_import.R: import IMDB data
  • data_prep.R: split samples into word vectors and aggregate the buckets of samples and labels into a list
#####################################################
### Load preprocessed data
# corpus_bucketed_train <- readRDS(file = "data/corpus_bucketed_train.rds")
# corpus_bucketed_test <- readRDS(file = "data/corpus_bucketed_test.rds")

corpus_bucketed_train <- readRDS(file = "data/corpus_bucketed_train_right_pad.rds")
corpus_bucketed_test <- readRDS(file = "data/corpus_bucketed_test_right_pad.rds")

vocab <- length(corpus_bucketed_test$dic)

### Create iterators
batch.size = 64

train.data <- mx.io.bucket.iter(buckets = corpus_bucketed_train$buckets, batch.size = batch.size, data.mask.element = 0, shuffle = TRUE)

eval.data <- mx.io.bucket.iter(buckets = corpus_bucketed_test$buckets, batch.size = batch.size, data.mask.element = 0, shuffle = FALSE)

Model training

devices <- mx.gpu(0)

initializer <- mx.init.Xavier(rnd_type = "gaussian", factor_type = "avg", magnitude = 2.5)

# optimizer <- mx.opt.create("adadelta", rho=0.90, epsilon=1e-5, wd=1e-5, clip_gradient=NULL, rescale.grad=1/batch_size)

optimizer <- mx.opt.create("rmsprop", learning.rate = 0.001, gamma1 = 0.95, gamma2 = 0.9, wd = 1e-5, clip_gradient=NULL, rescale.grad=1/batch.size)

batch.end.callback<- mx.callback.log.train.metric(period = 50)
epoch.end.callback<- mx.callback.log.train.metric(period = 1)

model_sentiment_rnn <- mx.rnn.buckets(train.data = train.data, eval.data = eval.data,
                                      num.round = 5, 
                                      ctx = devices, 
                                      metric = mx.metric.accuracy, 
                                      initializer = initializer, 
                                      optimizer = optimizer, 
                                      num.rnn.layer = 1, 
                                      num.embed = 4, 
                                      num.hidden = 8, 
                                      num.label = 2, 
                                      input.size = vocab, 
                                      dropout = 0.25,
                                      cell.type = "gru", 
                                      config = "seq-to-one", 
                                      batch.end.callback = batch.end.callback, 
                                      epoch.end.callback = epoch.end.callback,
                                      verbose = TRUE, 
                                      cudnn = TRUE)
## Start training with 1 devices
## Batch [50] Train-accuracy=0.4965625
## Batch [100] Train-accuracy=0.49921875
## Batch [150] Train-accuracy=0.506458333333333
## Batch [200] Train-accuracy=0.506953125
## Batch [250] Train-accuracy=0.516
## Batch [300] Train-accuracy=0.526666666666667
## Batch [350] Train-accuracy=0.538080357142857
## [1] Train-accuracy=0.548025063451777
## [1] Validation-accuracy=0.693861323155216
## Batch [50] Train-accuracy=0.715625
## Batch [100] Train-accuracy=0.73890625
## Batch [150] Train-accuracy=0.759479166666667
## Batch [200] Train-accuracy=0.778515625
## Batch [250] Train-accuracy=0.7933125
## Batch [300] Train-accuracy=0.804895833333333
## Batch [350] Train-accuracy=0.814330357142857
## [2] Train-accuracy=0.819440038071066
## [2] Validation-accuracy=0.843392175572519
## Batch [50] Train-accuracy=0.88
## Batch [100] Train-accuracy=0.8921875
## Batch [150] Train-accuracy=0.892916666666667
## Batch [200] Train-accuracy=0.89125
## Batch [250] Train-accuracy=0.8925625
## Batch [300] Train-accuracy=0.893125
## Batch [350] Train-accuracy=0.894598214285714
## [3] Train-accuracy=0.894868337563452
## [3] Validation-accuracy=0.877584287531807
## Batch [50] Train-accuracy=0.92875
## Batch [100] Train-accuracy=0.91703125
## Batch [150] Train-accuracy=0.9203125
## Batch [200] Train-accuracy=0.917578125
## Batch [250] Train-accuracy=0.918375
## Batch [300] Train-accuracy=0.916979166666667
## Batch [350] Train-accuracy=0.914866071428571
## [4] Train-accuracy=0.915529822335025
## [4] Validation-accuracy=0.876709605597964
## Batch [50] Train-accuracy=0.94125
## Batch [100] Train-accuracy=0.926875
## Batch [150] Train-accuracy=0.92875
## Batch [200] Train-accuracy=0.92828125
## Batch [250] Train-accuracy=0.9261875
## Batch [300] Train-accuracy=0.92640625
## Batch [350] Train-accuracy=0.925625
## [5] Train-accuracy=0.926316624365482
## [5] Validation-accuracy=0.870388040712468
mx.model.save(model_sentiment_rnn, prefix = "models/model_sentiment_rnn", iteration = 5)

Inference

#####################################################
### Inference
ctx <- list(mx.gpu(0))
model_sentiment <- mx.model.load(prefix = "models/model_sentiment_rnn", iteration = 5)

corpus_bucketed_train <- readRDS(file = "data/corpus_bucketed_train_right_pad.rds")
corpus_bucketed_test <- readRDS(file = "data/corpus_bucketed_test_right_pad.rds")

Inference on train data

source("rnn.infer.R")

###############################################
### Inference on train
batch_size <- 64

train.data <- mx.io.bucket.iter(buckets = corpus_bucketed_train$buckets, batch.size = batch.size, data.mask.element = 0, shuffle = FALSE)

infer_train <- mx.rnn.infer.buckets(infer_iter = train.data, 
                                    model = model_sentiment,
                                    config="seq-to-one",
                                    ctx = ctx, 
                                    cell.type = "gru",
                                    kvstore="local",
                                    cudnn = TRUE,
                                    num.rnn.layer = 1, 
                                    num.hidden = 8)

pred_train<- apply(infer_train$pred, 1, which.max)-1
label_train<- infer_train$label

acc_train<- sum(pred_train==label_train)/length(label_train)

roc_train<- roc(predictions = infer_train$pred[,2], labels = factor(label_train))
auc_train<- auc(roc_train)

Accuracy: AUC:

Inference on test

###############################################
### Inference on test
test.data <- mx.io.bucket.iter(buckets = corpus_bucketed_test$buckets, batch.size = batch.size, data.mask.element = 0, shuffle = FALSE)

infer_test <- mx.rnn.infer.buckets(test.data, 
                                   model = model_sentiment,
                                   config="seq-to-one",
                                   ctx = ctx,
                                   kvstore="local")

pred_test<- apply(infer_test$pred, 1, which.max)-1
label_test<- infer_test$label

acc_test<- sum(pred_test==label_test)/length(label_test)

roc_test<- roc(predictions = infer_test$pred[,2], labels = factor(label_test))
auc_test<- auc(roc_test)

Accuracy: AUC: